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Abstract.  Trust  region  algorithms  are  an  important  class  of  methods  that  can  be  used  to 
solve  unconstrained  optimization  problems.  More'  [10]  has  proven  a  global  convergence  result  for  a 
class  of  trust  region  methods  where  the  gradient  values  are  approximated  rather  than  computed  ex¬ 
actly,  provided  the  approximations  are  consistent.  We  show  that  the  assumption  of  consistency 
can  be  replaced  by  a  simple  condition  on  the  relative  error  in  the  gradient  approximation.  This 
new  condition  has  both  practical  and  theoretical  advantages.  First,  it  provides  a  practical  test  for 
judging  the  adequacy  of  a  given  gradient  approximation,  and  does  not  require  new  approximations 
to  be  computed  for  unsuccessful  iterations.  Second,  it  leads  to  stronger  convergence  results  than 
obtained  in  [10] . 
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1.  Introduction 


1.1.  Trust  region  algorithms  using  inexact  gradients.  This  paper  considers  trust  region 
methods  for  the  solution  of  the  unconstrained  optimization  problem 


minimize  f  (i r) , 
* 


(1.1) 


with  /  :Rn  — *R1.  These  methods  generate  iterates  {xt}  by  producing  and  approximately  solving 
a  sequence  of  constrained  quadratic  model  problems.  That  is,  x*+1  =  xk  +  sk  for  a  step  sk  that 
approximately  solves 


minimize  ^k{xk  +  a ) :  | \Dk s  1 1  <  A*  /,  2) 

.eR"  v  ' 

where  Dk  GRBXn  is  a  scaling  matrix,  A*  is  a  positive  variable  known  as  the  trust  radius,  and  is 
a  quadratic  model  of  /  about  the  point  xk : 

'l>k(xk+«)  =  f(xk)+gkTs  +UsTBks  .  (1.3) 

The  vector  gk  £  R"  is  thus  the  gradient  of  ipk  at  xk  and  the  symmetric  matrix  Bk  €ERnXn  is  the 
Hessian  of  ipk.  Ideally,  gk  should  be  identical  to  V/(x*)  (the  gradient  of  /  at  xk)  while  Bk  should 
be  identical  to  V2/ (xk)  (the  Hessian  of  /  at  xk),  but  it  may  not  be  practical  to  compute  these 
quantities  exactly. 

Strong  global  convergence  results  have  been  shown  for  trust  region  algorithms  which  take 
9k  =  V/(xt)  (see>  f°r  example  [l],  [3],  [7],  [13],  and  [14]).  If  the  sequence  of  Hessian  approxima¬ 
tions  {Bk}  is  uniformly  bounded,  mild  conditions  of  /  and  {/}*}  are  sufficient  to  establish  that 


lim  ||V/(zi)  il  =  0  (1.4) 

*—.00  '  ’ 

for  most  implementations. 

More'  [10]  considers  the  global  convergence  of  a  class  of  trust  region  algorithms  in  which  the 
condition  gk  =  V/(xt)  is  relaxed.  Instead  of  requiring  exact  gradient  values,  More'  allows  gk  to  be 
an  approximation  to  V/  (xk )  provided  the  sequence  of  approximations  satisfies  the  consistency  pro¬ 
perty 


xk  -*■  x  ’  =>  Hm  Ilf* -V/(x*)  ||  =  0  . 

fc—KJO 


(1.5) 
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Using  (1.5)  as  a  primary  assumption,  he  is  able  to  establish1 


Hminf  ||ff*  ||  =  °  (1.6) 

but  provides  no  suggestions  about  how  (1.5)  should  be  enforced. 

In  this  paper,  we  show  that  the  consistency  assumption  (1.5)  can  be  replaced  by  a  condition 
on  the  relative  error  in  the  gradient  approximation.  In  simplest  form,2  this  condition  is 


\\9k-VfM\\  ^ 

Ik  II 


V  k 


(1.7) 


for  some  constant3  f  <  1.  Since  error  estimates  are  often  available  when  approximate  gradients 
are  calculated,  (1.7)  provides  a  practical  test  for  judging  the  adequacy  of  a  given  gradient  approxi¬ 
mation.  We  argue  that  this  is  a  more  natural  approach  than  trying  to  enforce  (1.5)  directly. 
Furthermore,  (1.7)  leads  to  the  global  convergence  result 


Bmllfcll-0,  (1.8) 

which  is  much  stronger  than  (1.6)  under  certain  conditions.4  Moreover,  (1.7)  and  (1.8)  imply 

hm  |k-V/(z*)||=°,  (19) 

which  is  an  even  stronger  consistency  property  than  (1.5).  Consistency  of  the  gradient  approxima¬ 
tions  is  therefore  a  consequence  of  our  theory  rather  than  an  assumption. 


1.2.  Structure  of  trust  region  algorithms.  Before  presenting  justification  for  our  claim  that 
(1.7)  is  a  more  practical  condition  to  directly  enforce  than  (1.5),  the  structure  of  trust  region  algo¬ 
rithms  must  be  described  in  more  detail.  Authors  typically  describe  trust  region  algorithms  by 


1  More  specifically,  he  establishes  liminf  ||pt  ||  r  ,  —  0,  where  the  elliptical  norm  ||x  ||^  is  defined  to  be 

i-^oo  yPh  k'h) 

||x  ||A  s(xTAx )*  for  symmetric  positive  definite  A  6R*X*.  For  implementations  which  require  \\DkTDk  ||<^i2  and 
||(Z)t  TDt)~l  ] |  < <t22  for  constants  ax,o^.  Morn’s  result  is  thus  equivalent  to  liminf  ||yt  II  “0. 

k—*co 

®This  form  is  valid  for  Dk  —  I,  slightly  different  forms  of  this  condition  will  be  used  for  the  more  general  case  Dt  ^  I . 

*  The  value  of  f  will  depend  upon  some  of  the  other  parameters  in  the  trust  region  implementation,  but  will  typical¬ 
ly  be  about  0.9. 

■•Notice  that  (1.7)  and  (1.8)  imply  lim  1 1 V/ (xt)  || -0,  the  same  strong  global  convergence  result  obtained  when 

i— «x> 

s  V/(it).  If  {zt}  converges,  then  (l.S)  and  (1.6)  also  imply  this  strong  result,  but  if  {zt}  is  unbounded  or  has  more 
than  one  limit  point,  then  (1.5)  and  (1.6)  do  not  even  imply  lim  inf  ||V/  (zt)  ||  —  0. 
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using  either  a  single  loop  indexing  system  (see,  for  example,  [10]  and  [13])  or  a  nested  loop  indexing 
system  (see,  for  example  [l],  [14],  and  [15].)  For  methods  that  take  gk  =  V/(z*),  the  differences 
between  these  two  indexing  systems  are  purely  semantic,  but  for  gk  ^  Vf(xk)  it  is  instructive  to 
consider  them  separately.  The  single  loop  structure  used  in  [10]  is  as  follows. 

Algorithm  (l).  Single  loop  structure  for  the  trust  region  method. 

Let  0  <  «7j  <  f?2  <  1  and  0  <  7X  <  1  <  72  be  prespecified.6  Select  an  initial  guess  z0€:  and  trust 

radius  A0  >  0.  Compute  /  (ar0),  and  compute  or  initialize  g0,  B0,  and  D0. 

For  k  =0, 1, ...,  until  “convergence”  do: 

(a)  Determine  an  approximate  solution  sk  to  problem  (1.2). 

(b)  Compute  pk  =  (/  {xk )  -  /  (ar*  +  sk  ))/{rpk(xk)  -  r/>k{xk  +  «*)). 

(c)  If  pk  <  r/l  then  set  sk  =  0  and  Ai+1  £  (0,  7X  A*]. 

(d)  If  T7X  <  pji,  <  ri2  then  set  A*+1  £  fri  A* ,  A*]. 

(e)  If  V2  <  Pk  then  set  At+1  £  [A* ,  72,  A*  ] . 

(f)  Set  xk+i  =  xk  +  sk  and  update  gk,Bk,  and  Dk . 

End  loop. 

In  this  structure,  trial  steps  sk  are  rejected  and  the  trust  radius  is  reduced  if  pk  <  r/j.  Such 
an  iteration  is  called  unsuccessful  since  z*+1  =  xk ;  iterations  for  which  pk  >  r)1  are  called  successful. 
Clearly,  step  (c)  is  designed  to  prevent  an  infinite  series  of  unsuccessful  iterations,  while  steps  (d) 
and  (e)  are  designed  to  pick  a  trust  radius  for  the  next  iteration  that  is  small  enough  to  have  a 
good  chance  of  producing  a  successful  step  yet  large  enough  to  permit  rapid  convergence. 

The  structure  of  Algorithm  (1)  neither  requires  nor  prohibits  updates  of  gk ,  Bk,  and  Dk  at 
unsuccessful  iterations.  In  implementations  which  take  gk  =  V/ (xk),  such  updates  are  rarely  found 

®Typical  values  for  these  parameters  ar  i71  — 0.001,  rj2  —  0.1,  7)  — 0.25,  —  4.0. 
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in  the  literature.8  The  following  algorithm  uses  a  nested  loop  structure  in  which  the  outer  loop 
indexes  only  successful  iterations  and  updates  to  gk,  Bk,  and  Dk  are  not  allowed  during  the  inner 
loop. 

Algorithm  (2).  Nested  loop  structure  for  the  trust  region  method. 

Let  0  <  t)l  <  rj2  <  1  and  0  <  7!  <  1  <  72  be  prespecified.  Select  an  initial  guess  z0GR"  and  trust 
radius  A0  >  0.  Compute  /  (x0),  and  compute  or  initialize  g0,  B0,  and  D0. 

For  k  =0,1,  •  •  ■  until  “convergence”  do: 

(a)  Repeat  until  pk  >  *jj: 

(a.l)  Determine  an  approximate  solution  to  sk  to  problem  (1.2) 

(a.2)  Compute  Pk  =  (f  (xk)  -  f  (xk  +  sk))/(xpk(xk)  -  {tpk(xk  +  ak)) 

(a. 3)  If  pk  <  then  set  A*  €(0, ')flA*]. 

End  loop. 

(b)  If  pk  <  ti2  then  set  A*  +1  G  (0,  A*  ] ,  else  set  A* +,  £  [ A* ,  72 A*  ] . 

(c)  Set  x*+1  =  xk  +  sk  and  update  gk,Bk,  and  Dk . 

End  loop. 

The  form  of  Algorithm  (2)  raises  the  possibility  that  at  some  iteration  k,  the  inner  loop  (a) 
may  fail  to  generate  an  acceptable  new  iterate.  Consider,  for  example,7  an  initial  gradient  approxi¬ 
mation  g0=  —  V/(x0)  with  B0=  D0=  I .  Since  every  descent  direction  for  /  is  an  ascent  direction 
for  Vo.  Pa  wiU  he  negative8  no  matter  how  much  Aq  is  reduced  in  the  inner  loop  (a).  Such  failures 

®This  is  not  to  say  they  are  unimportant.  The  well  known  algorithm  NL2SOL  [6]  for  the  solution  of  the  nonlinear 
least  squares  problem  owes  much  of  its  success  to  its  capability  of  switching  between  alternate  Hessian  approximations. 
Global  convergence  theory  for  such  switching  is  given  in  jl|  and  |4]  in  sufficient  generality  to  provide  a  framework  for  an 
expert  systems  approach  to  optimization.  However,  |l),  (4j,  and  |6]  all  take  gk  s  Vf(xk),  which  makes  the  question  of  up¬ 
dating  gk  at  unsuccessful  iterations  moot. 

7This  example  is  presented  in  greater  detail  in  Section  3  of  this  paper. 

8Although  this  example  depends  on  the  angle  between  gk  and  V/(zt)  being  greater  than  ninety  degrees,  another  ex¬ 
ample  will  be  presented  in  Section  3  that  demonstrates  the  possibility  of  failure  in  the  inner  loop  even  if  the  angle  between 
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of  the  inner  loop  to  converge  can  occur  at  any  iteration  unless: 

(i)  additional  conditions  are  imposed  on  gk ,  or 

(ii)  ft  is  successively  improved  in  the  inner  loop  so  that  lim  lift  —  V/(z*)  ||  =  0. 

*—•00 

The  latter  approach  is  implicit  in  the  formal  statement  of  Morels  algorithm,  but  we  prefer  impos¬ 
ing  the  additional  condition 


llft-v/(**)||(Vi,tri 
lift  IU  Tn\- 1 


<f. 


(1.10) 


'(DjDkY 

We  show  in  this  paper  that  if  f  G  [0, 1  —  r^),  then  (1.10)  is  sufficient  to  ensure  the  success  of  the 
inner  loop.  If  error  estimates  are  available,  (1.10)  can  be  checked  at  the  start  of  every  iteration  k 
and  gk  can  be  recomputed  if  necessary.  Trying  to  use  approach  (ii)  so  that  the  analysis  of  [10] 
holds  is  less  practical  because  it  involves  recomputing  ft  with  successively  greater  accuracy  as  A* 
decreases  in  the  inner  loop  without  regard  for  whether  error  in  ft  is  the  problem  or  whether  A*  is 
really  too  large.  Since  unsuccessful  steps  are  quite  common  even  with  ft=V/(a:t)  and  since 
recomputing  gk  with  successively  greater  accuracy  is  generally  very  expensive  computationally,  this 
approach  is  much  less  satisfying  than  using  (1.10). 


Even  if  all  the  iterates  are  acceptable,  directly  enforcing  the  consistency  condition  (1.5) 
presents  a  practical  difficulty  in  that  no  specification  is  made  about  how  fast  to  force 
{lift  —  V/(xt)  ||}  to  converge  to  zero.  One  might  enforce  the  condition 


lift  -  V/(z*)  ||  <  c  ||ft  ||  (1  11) 

for  some  constant  c  G(0,  oo),  but  (1.5)  provides  no  suggestions  for  selecting  a  reasonable  value  for 
c .  On  the  other  hand,  a  reasonable  value  for  f  in  (1.10)  is  much  easier  to  select  since  we  show  that 
strong  global  convergence  results  can  be  obtained  for  any  f  G  [0, 1  — 1}2). 


1.3.  Synopsis.  In  Section  2  of  this  paper,  we  briefly  discuss  the  techniques  generally  used  to 
compute  trial  steps  for  a  given  model,  scaling  matrix,  and  trust  radius.  In  Section  3,  we  present 
two  detailed  examples  of  how  the  inner  loop  of  Algorithm  (2)  can  indeed  fail  to  produce  a  solution. 

ft  and  V/( zj)  is  zero. 
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We  then  show  that  condition  (1.10)  with  f  <  1  —  i/i  is  sufficient  to  ensure  the  success  of  the  inner 

loop.  In  Section  4,  we  show  that  (1.10)  with  f  <  1  —  tj2  is  sufficient  to  establish 

liminf  ||j*  1 1  =  lim  inf  ||V/(r*)||  =0  provided  {B*},  {DkT  Dk)  and  {(Dk  r£>*)-1}  are  uniformly 
*—•00  *— ►  00 

bounded.  We  then  demonstrate  two  ways  that  the  stronger  convergence  result 

lim  1 1 V / ( x* )  ||  =0  can  be  obtained  using  (1.10)  with  f<l  — 1?2  given  that  {Bk}  is  uniformly 
*—•00 

bounded  and  {Dk}  satisfies  some  mild  assumption.  The  final  section  of  this  paper  summarizes  our 
results  and  suggests  some  possibilities  for  future  study. 

1.4.  Nomenclature  and  standard  assumptions.  In  addition  to  the  notation  already  intro¬ 
duced,  the  following  definitions  and  conventions  are  used  throughout  this  paper.  Unless  otherwise 
specified,  ||  ■  |j  denotes  the  Euclidean  norm  (or  the  matrix  norm  induced  by  the  Euclidean  norm), 
while  1 1 x  is  the  elliptical  norm  {xT  Ax)*  for  A  a  symmetric  positive  definite  matrix  in  Ft"**. 
A  function  h :  R“  — *■  Rm  is  said  to  be  Lipschitz  with  constant  L  in  an  open  convex  region  H  if 
1 1 A (x )  —  A (y )  ||  <L  1 1 x  —  y  ||  \/-  x,y  The  level  set  of  a  function  /  at  a  point  xk  £ R"  is 

the  set  of  all  x  £ R"  such  that  f(x)<f(xk). 

Let  n  be  an  open  convex  set  containing  the  level  set  of  /  at  x0.  The  function  /  :  R*  — ♦  R 
is  said  to  satisfy  the  standard  assumptions  if 


/  is  continuously  differentiable  on  H  , 

(1.12a) 

/  is  bounded  below,  and 

(1.12b) 

V f  is  Lipschitz  with  constant  L  in  H. 

(1.12c) 

It  is  frequently  convenient  to  represent  the  trust  region  subproblem  in 

local  coordinates.  We 

define  the  predicted  function  reduction  predk(s )  as 

predk(s)  2  rt>k{xk)-4ik(xk  +s) 

(1.13) 

=  -  SkT  »  Bks  , 

and  the  actual  function  reduction  aredk(s)  is  defined 

aredk(s )  2  f{xk)-f(xk  +  s)  . 

(1.14) 
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Although  the  notation  used  in  Algorithm  (2)  is  usually  convenient,  a  rigorous  treatment  of 
the  success  or  failure  of  the  inner  loop  requires  indexing  the  trial  steps  and  trial  trust  radii  within 
the  loop.  The  following  algorithm  is  completely  equivalent  to  Algorithm  2,  but  introduces  some 
additional  nomenclature.  Specifically,  {«' }  represents  the  complete  sequence  of  trial  steps  gen¬ 
erated  and  {A1}  represents  the  corresponding  trial  trust  radii,  so  that  {«*}C{«'}  and  {At}C{A' }. 

Algorithm  (3).  Trust  region  method  with  full  notation. 

Let  0  <  »?]  <  i?2  <  1  and  0  <  7!  <  1  <  q2  be  prespecified.  Select  an  initial  guess  i0GR*  and  trust 
radius  A0  >  0.  Compute  /  (a-0),  and  compute  or  initialize  g0,  B0,  and  D0.  Set  t  =  —1. 

For  k  =0,1, ...,  until  “convergence”  do. 

(a)  Repeat  until  p'  >  : 

(a.l)  Increment  »  and  determine  an  approximate  solution  s’  to 

minimize  i>k{xk  +  $):  ||P*s  ||  <  A'  . 

•eR" 

(a.2)  Compute 

p'  =  aredk(s')/predk(s') 

(a. 3)  If  p'  <  f?!  then  set  A,+1  G(0,  7j  A' ], 

Else  set  tk  —  s' ,  Ak  =  A' ,  and  pk  =  p' . 

End  loop. 

(b)  If  p'  <  f?2  then  set  A* +1  €  (0,  A’  ] , 

Else  set  A'+1C  [A1, 72A']. 

(c)  Set  xk+i  —  xk  +  8k  and  update  gk,Bk,  and  Dk . 

End  loop. 


(1.15) 

(1.16) 
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2.  Computation  of  trial  steps. 


2.1.  Introduction.  In  order  to  establish  our  results,  we  must  use  several  properties  satisfied  by 
standard  techniques  for  computing  trial  steps.  This  section  summarizes  these  properties,  but  is  not 
intended  to  be  a  comprehensive  discussion  of  methods  of  step  computation.  An  excellent  survey 
(with  an  extensive  bibliography)  of  the  many  step  computation  strategies  is  presented  in  [10]. 
Readers  familiar  with  these  techniques  may  wish  to  proceed  directly  to  Section  3. 


2.2.  Scaling  matrices  and  preconditioning.  For  any  nonsingular  Dj  £E,X’  consider  the 
change  of  variables 


x  -  Dkx 

so  that  a  =  Dka  and  xk  =  Dkxk.  Then  the  definitions 

&(**  +«)  =  (**+*), 

predk{i)  m  predk{a) , 

and 


lead  to 


aredk(a)  =  aredk  («) 

pr-edk(i)  =  -  jjkTa -UeTBka  , 
aredk(s)  =  f  (xk)  -  f  (xk  +  Dk~ls), 
9k  =  Dk~Tgk  , 


and 


Bk  =  Dk~TBk  Dk . 

In  this  notation,  (1.15)  becomes  the  simpler  problem  of  finding  s'  that  approximately  solves 

minimize  r[>k(x  k  +  i) :  ||*  ||  <  A'  . 
ie  B* 

The  step  *'  can  then  be  recovered  by  inverting  transformation  (2.1)  to  give 


(2.1) 

(2.2) 

(2.3) 

(2.4) 

(2.5) 

(2.6) 

(2.7) 

(2.8) 

(2.9) 


=  Dk-lP  . 


(2.10) 
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Typically,  methods  for  calculating  s'  use  (2.9)  and  (2.10)  rather  than  (1.15),  although  the 
change  of  variables  (2.1)  need  not  be  explicitly  performed.  Consider  the  relationship  between  the 
method  of  steepest  descent,  which  takes 

s'  =  -agk  (2.11) 

for  some  positive  a,  and  the  preconditioned  steepest  descent  method,  which  uses  a  positive  definite 
preconditioning  matrix  Ck  £R,X*  and  sets 

«’  =  -aCr'Sk  (2.12) 

for  some  positive  a.  Although  this  preconditioning  does  not  explicitly  use  scaling  (2.1),  applying 
the  method  of  steepest  descent  to  the  scaled  problem  (2.9)  yields 

s'  =  -agk  (2.13) 

or 

«*'  =-0'{DkTDk)~'gk  (2.14) 

so  that  (2.12)  implicitly  uses  a  change  of  variables  for  which  DkT Dk  =  Ck. 

The  matrices  Dk  are  often  assumed  to  be  diagonal  in  trust  region  literature.  Because  of  the 
relationship  between  scaling  and  preconditioning,  we  prefer  not  to  make  this  assumption,  as  nondi¬ 
agonal  preconditioners  are  widely  used  in  conjugate  direction  methods  for  large  scale  problems. 

2.3.  Asymptotic  behavior  of  step  directions.  The  first  property  that  we  will  need  concerns 
the  direction  that  trial  steps  «'  tend  tow'ard  as  the  trial  trust  radii  tend  toward  zero.  This  pro¬ 
perty  is,  obviously,  directly  dependent  on  the  method  used  to  compute  the  trial  steps.  Let  ©’  be 
defined  to  be  the  angle  between  s'  and  —gk  so  that 

cos©’'  =-('si)Tgk/{  ||a~’  ||  ||?~*  ||)  .  (2.15) 

We  will  show  that  for  the  two  major  classes  of  solution  techniques,  if  A1  —♦0  in  the  inner  loop  of 

Algorithm  (3)  and  gk  ^  0,  then  cos  ©’  — ►  1.  Furthermore,  let  ©*  be  the  angle  between  sk  and  —  gk. 

If  an  infinite  sequence  of  successful  iterates  are  generated  and  limsup  ||B*  ||  <  co,  then  A*  — *  0 

*—.00 

and  lim inf  ||y*  ||  >0  imply  that  cos©*  — » 1. 

*— *  OO 
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One  of  the  major  classes  of  solution  methods  is  based  on  the  following  powerful  result.8 

THEOREM  2.1.  Let  g  be  a  vector  in  R",  let  B  GR*x*  be  symmetric,  and  let  D  £R,X*  be 
nonsingular.  A  vector  s  G  R*  is  a  global  solution  to 

minimize  gT e  +  'AsTBs  :  \\Ds  ||  <  A  (216) 

if  and  only  if  s  and  A  obey  the  following  relations  for  some  p>  0. 

(B  +  pDTD)s  =  -  g  ,  (2.17) 

IU><  II  <  A,  (2.18) 

//(A-  i|Z>«  ||)  =  0,  (2.19) 

and 

B  +  p  DrD  is  positive  semidefinite.  (2.20) 

Furthermore,  if  B  +  p  DT D  is  positive  definite,  then  (2.16)  has  a  unique  global  solution. 

Theorem  2.1  is  unusually  strong  in  that  it  completely  characterizes  all  of  the  global  solutions 
of  problem  (2.16),  and  simultaneously  suggests  an  approach  to  approximately  solving  the  trust 
region  subproblem  for  a  prespecified  A' .  Consider  any  p  >  0  which  is  sufficiently  large  to  make 
B + p DT D  positive  semidefinite,  and  let  s(p)  be  a  solution  to  (2.17).  Furthermore,  define 
A (p)=  ||£*«(/t)||  so  that  (2.18)  and  (2.19)  are  satisfied.  We  see  that  s{p)  exactly  solves  (2.16) 
for10 

A  =  A(/t)  and  hence  one  possible  approach  to  solving  the  trust  region  subproblem  is  to  use  some 
sort  of  procedure  to  find  a  p'  for  which  A (p' ) « A' .  If,  for  example,  a  p'  is  found  for  which 
A (p‘ )  =  (1  +  c)  A'  for  some  small  c,  then  s(p' )  is  an  exact  solution  to  the  problem 

minimize  gT e  +V4«r Be:  \\Ds  ||  <(l+t)A'  .  (2-21) 

Methods  of  this  type  are  sometimes  called  optimal  locally  constrained  [8],  or  OLC  methods. 
Such  methods  approximately  solve  the  trust  region  subproblem  (1.15)  by  exactly  solving  the  nearby 

*This  well  known  result  is  founded  on  work  done  by  Goldfeld,  Quandt,  and  Trotter  |9],  and  was  first  stated  in 
modern  form  by  Gay  |8|  and  Sorensen  [15].  The  reader  is  referred  to  [10]  for  a  more  complete  history  and  discussion. 

10ln  fact,  if  [i  —  0  and  B  is  symmetric  positive  definite,  »(p)  exactly  solves  (2.16)  for  every  A>A(p).  That  is,  p  —  0 
corresponds  to  the  constraint  not  being  binding. 
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problem  (expressed  here  in 

scaled  form): 

with 

minimize  g *s  +  Mi  sT Bki  :  ||«  ||  <  A* 

(2.22) 

ClA'  <  A’  <  c2A* 

(2.23) 

for  some  constants11  c  j  6  (0,  lj  and  c2€  [1,2).  Trial  steps  therefore  satisfy 

ll«~'  II  <  <2A*'  •  (2.24) 

A  large  number  of  efficient  techniques  can  be  found  in  the  literature  for  finding  a  satisfactory  n' . 
Experimental  results  have  been  published  for  several  implementations  [10]  in  which  the  average 
number  of  matrix  factorizations  (of  B  +  ft  DTD)  required  to  find  an  acceptable  n'  is  roughly  1.5. 

We  have  now  characterized  OLC  methods  sufficiently  to  examine  the  directional  behavior  of 
s’  as  A'  — ►  0. 

THEOREM  2.2.  For  it  =  1, 2, ...,  Jtmax  <  oo,  let  {j*}  be  a  set  of  vectors  in  It*  and  let  {£?*}  be 
a  set  of  symmetric  matrices  in  R*XV  Let  {A1}  be  a  sequence  of  positive  numbers  with 
{A*}  C  {A1 }  ,  let  «'  £  R*  be  calculated  by  an  OLC  method,  and  define  ©‘  to  be  the  angle  between 
the  vectors  s’  and  —  gk .  We  then  have  the  following. 

(i)  For  fixed  k ,  either  gk  —  0  or 

lim  A'  =  0  =;>  lim  cos  (©'  )=  1  .  (2.25) 

«— *oo  »— *oo  ' 

(ii)  Suppose  {A*}  is  an  infinite  sequence  and  let  {«*}  be  the  subsequence  of  {«’}  associated  with 

{At}.  Define  ©*  to  be  the  angle  between  sk  and  ~gk.  If  lim  sup  Ik*  II  <  oo,  then  either 

*—►00 

lim  inf  ||p*  ||  =0  or 

*— oo 

lim  A'  =0  =^>  lim  cos(©')=l  =>  lim  cos©*  =  1.  (2.26) 

The  proof  of  this  theorem  is  given  in  the  appendix  of  this  paper  since  it  is  rather  unenlight¬ 
ening.  It  should  be  pointed  out  that  (i)  above  is  well  known  but  is  seldom  stated  in  the  literature, 

UA  typical  choice  is  Cj  — 0.9  and  c2«*l.l. 
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since  standard  convergence  theory  with  gk  sV/ (x*)  can  be  established  without  invoking  (2.25). 12 

The  major  alternative  to  OLC  methods  for  computing  approximate  solutions  to  (2.9)  is  a 
class  of  techniques  that  we  will  refer  to  as  generalized  dogleg  methods.  The  oldest  and  simplest 
such  method  is  Powell’s  dogleg  algorithm  (l  1] .  This  method13  defines  a  piecewise  linear  path  «(a) 
starting  at  s  —  0,  extending  to  the  Cauchy  step 


»k 


Ct 


~  T  ~ 

9k  9k 
9kBk9k 


9  k  , 


(2.27) 


“Although  it  is  possible  to  prove  many  of  our  results  without  using  Theorem  2.2,  such  analysis  requires  f  «  1  —  r/2 
in  some  cases. 

wPowell  originally  only  considered  Dt  ss/.  The  more  general  form  given  here  is  sometimes  called  the  preconditioned 

dogleg. 
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and  then  proceeding  to  the  quasi-Newton  step 

«V"  =  -Bk~lgk  ,  (2.28) 

(assuming  for  the  moment  that  Bk  is  positive  definite.) 

If  sk,n  is  inside  the  trust  region,  then  s'  is  taken  to  be  «*,V  Otherwise  the  dogleg  method 
sets  s'  as  the  intersection  of  s(a)  with  the  surface  of  the  trust  region.  In  either  event,  s'  minim¬ 
izes  $>k{ik  +  «(<*)):  ||a(o)  ||  <A‘.  This  method  has  the  advantage  of  requiring  only  one  matrix 
factorization  per  major  iteration:  Once  sk,K  has  been  calculated,  computation  of  «'  for  any  given 
A'  is  trivial.  Figure  1  illustrates  trial  steps  computed  by  the  dogleg  method  for  different  values  of 
A.  It  is  clear  that,  for  sufficiently  small  A1 ,  the  trial  step  s'  is  a  positive  multiple  of  —gk ,  the 
direction  of  steepest  descent  for  the  model. 

Other  methods  exist  in  the  literature  which  compute  approximate  solutions  to  the  trust 
region  subproblems  by  minimizing  over  a  piecewise  linear  path.  The  double  dogleg  of  Dennis  and 
Mei  [5]  uses  a  path  with  one  extra  “leg”  in  order  to  give  a  larger  bias  toward  the  quasi-Newton 
direction  —  Bk~lgk.  Steihaug  [17]  uses  a  dogleg  path  defined  by  the  steps  generated  by  a  conju¬ 
gate  gradient  method  (with  preconditioner  DkT Dk)  applied  to  the  problem  Bks  -  —  gk.  Other 
dogleg  methods  exist  (see,  for  example,  [14])  that  take  advantage  of  negative  curvature  in  $  (i.e., 
Bk  need  not  be  positive  definite.)  All  of  these  methods  use  skc ’  as  the  initial  segment  of  the 
dogleg  and  define  «(»)  such  that  ||«(o)  ||  is  increasing  so  that  the  intersection  of  s(a)  and  the 
surface  of  a  trust  region  will  be  unique.  We  can  therefore  state  the  following. 

PROPOSITION  2.3.  The  conclusions  of  Theorem  2.2  remain  valid  if  each  s'  is  computed  by 
a  generalized  dogleg  method  rather  than  an  OLC  method. 

Proof,  This  proposition  follows  immediately  from  (2.27),  the  uniqueness  of 
«(a)n{«:  ||«  ||  =  A' },  and  the  hypotheses  of  Theorem  2.2.  □ 
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2.4.  The  uniform  predicted  decrease  condition.  A  technical  condition  concerning  trial  steps 
computed  by  OLC  or  generalized  dogleg  methods  that  is  of  great  use  in  proving  global  convergence 
is  the  uniform  predicted  decrease1*  (UPD)  condition: 


predk  («')  >  c3  ||j*  ||  min 


1 9k 


(2.29) 


for  some  constants  c3  £  (0,  l]  and  ax  £  (0,  oo).  A  complete  discussion  of  this  condition  is  not  neces¬ 
sary  for  the  purposes  of  this  paper,  and  we  merely  give  the  well  known15  result  that  OLC  and  gen¬ 
eralized  dogleg  methods  satisfy  (2.29)  provided18 

V  *  •  (2.30) 


3.  Successful  termination  of  the  inner  loop. 


3.1.  Introduction.  Using  the  properties  of  trial  steps  described  in  the  last  section,  it  is  easy  to 
generate  examples  for  which  the  inner  loop  of  Algorithm  (3)  will  fail  to  find  an  acceptable  step  in  a 
finite  number  of  iterations. 

Example  3.1  :  gk  not  a  descent  direction.  Define  f[x)=V*xTx  and  select  any  nonzero 
x0.  We  have  aredk(s)  —  *6  x0T x0  —  iA(xq  +  «)T(x0+  s)  =  —  V/  (x0)T  $  —  %  s  Ts  .  Now  suppose 
that  ffo  =  —  V/  (x0),  B0=  I ,  and  D0  —  I .  We  have  that  predk(s)  =  V f  (x0)T  s  —\ksTs  and 

,•  _  arcdk(s')  _  -  V/  (x0)Ts‘  -%(s’)T(s') 

P  pre<M«’)  Vf  (*0)V-*(a')V) 

For  simplicity,  suppose  that  each  a1  is  being  computed  by  a  dogleg  procedure  so 
a'  =  —  A’  y0/  ||  jo  ||  for  any  A'  <  ||j0  ||.  Substituting  into  (3.1)  gives 


(3.1) 

that 


M  The  term  “uniform"  is  used  because  of  the  uniform  bound  ||Bt  1 1  < <tx  \/  k  as  opposed  to,  say,  bounds  of  the 
form  ||Bt  l|<^i(l  +  i)  V  *• 

16See  for  example,  [2],  |8],  |10|,  [11],  and  [14]. 

I8In  |2]  it  is  argued  that  the  weaker  assumption  of  a  uniform  upper  bound  on  {j kB tj k / g kT j k}  is  to  be  preferred, 
since:  (a)  this  also  implies  the  UPD  condition,  (b)  natural  methods  exist  for  enforcing  this  weaker  condition,  and  (c)  numer¬ 
ical  testing  of  these  safeguarding  techniques  has  shown  that  they  can  dramatically  improve  the  reliability  of  a  standard 
method  without  decreasing  the  overall  efficiency  of  the  overall  algorithm.  The  best  one  of  these  methods  is  probably  of 
limited  utility  for  models  with  gk  ?  V/  (it)  because  it  makes  use  of  first  order  differences  in  “j(*)”  to  safeguard  the  model 
Hessian,  but  an  alternate  safeguarding  technique  using  second  order  differences  in  /  is  also  shown  in  [2]  to  improve  relia¬ 
bility. 
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~  l|goll-V4A> 

II 0o  II  —  A* 


(3.2) 


Hence,  if  A°<  ||j0  II,  P*  <  0  for  any  A*  <  A0,  and  the  inner  loop  of  Algorithm  (3)  will  never  ter¬ 
minate.  Similar  examples  can  be  shown  for  any  gradient  approximation  and  scaling  matrix  which 
do  not  satisfy  (Dk~Tgk)T /  (**) )  >  O' 

Example  8.2  s  ||gk  ||  »  ||Vf  (xk)  ||.  Even  if  -(DkrDk)~igk  is  always  a  descent 
direction,  the  inner  loop  of  Algorithm  (3)  is  not  assured  of  success.  Consider  the  last  example  with 
2 

g o  taken  to  be  — V f  ( xQ ).  Again  taking  «’  =  —  A1  g0 /  1 1  g0  1 1 ,  we  have 


-  gjs'  -  fc(*‘)T(«') 


Then  for  any  A’  <  A°<  min{  ||y0  ||,  ||V/  (a:0)  j|}  we  have 


P* 


=  #»?i 


||V/  K)||-%A' 

II V/  (*0 )  ||  —  l/<  A' 


> 


so  that  the  inner  loop  of  Algorithm  (3)  will  never  find  a  successful  iterate. 


(3.3) 


(3.4) 


These  examples,  although  rather  extreme,  clearly  demonstrate  that  additional  conditions 
must  be  imposed  on  the  gradient  approximation  to  assure  the  finite  termination  of  the  inner  loop 
at  every  major  iteration.  It  should  be  pointed  out  that  this  in  no  way  contradicts  Morels  result 
that  consistency  of  the  gradient  approximations  implies  liminf  ||j*  1 1  m  td  >-i=0  for  Algorithm 

t— .00 

(l).  Since  his  notation  includes  both  inner  and  outer  loops,  hypothesis  (1.5)  becomes 


(xt  — *•**)  or  («'  — >0  for  fixed,  k)  lim  1 1  g‘  —  V/  (**)  1 1  =  0  (3.5) 

in  our  notation,  where  {j1}  is  the  set  of  approximations  to  V/  (xk )  used  in  the  inner  loop.  How¬ 
ever,  we  prefer  algorithms  which  keep  a  fixed  approximation  during  the  inner  loop,  and  Morels 
hypothesis  cannot  be  applied  directly  to  Algorithms  (2)  or  (3). 


3.2.  Ensuring  successful  termination  of  the  inner  loop.  In  this  section  we  show  that  if  the 
relative  error  in  the  gradient  approximation  is  less  than  1  —*h,  at  a  given  iteration,  then  the  inner 
loop  of  Algorithm  (3)  is  assured  of  finding  a  successful  new  iterate.  The  following  lemma  will 
prove  useful. 
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LEMMA  3.1.  Let  /  :  RB  — ♦  Ft  be  continuously  differentiable  on  an  open  convex  set  fl  contain¬ 
ing  a  point  xk,  and  let  V/  be  Lipschitz  continuous  on  fl  with  constant  L  £  (0,  oo).  Let  the  func¬ 
tions  aredk(s )  and  predk(s)  be  defined  as  in  (1-13)  and  (1  11).  Let  X”ln  £  (— oo,  oo)  be  the  smallest 
eigenvalue  of  Bk  and  let  X““£  [X™111,  oo)  be  the  largest.  If  the  error  in  gk  is  defined  to  be 

<*  =  9k  ~Vf{xk),  (3.6) 

then  for  all  a  £R*  such  that  xk  +  a  £  Cl,  we  have 

-V4||«  ||2(L +X*m“)- e/a  <  predk(s)~  aredk(s)  <%  ||a  | \\L  -  X*mu>)  -  e/a  .  (3.7) 


Proof.  We  first  use  an  integral  representation  of  aredk(s)  to  establish 

l 

predk(s)  —  aredk(s)  =  —  gkT a  —  %  a  r Bk«  +  J  V f  (xk  +  \«)T s  d\ 

,  °  (3-8) 

«-e/a  -*tTBks  +  f(Vf(xk+\s)-VJ(zk))Ts  d\  . 

0 


Now,  X*min  ||«  \\2<sTBks  <X*m“  ||s  ||2  and 

1 

</  l|V/(**+X.)-V/(**)||  ||s  llrfX 

°  (3.9) 

<  JL  ||Xa  ||  ||e  ||rfX  =  *L  ||e  ||2  . 

o 

Substituting  these  bounds  into  (3.8)  immediately  establishes  (3.7). 

We  can  now  establish  the  main  result  of  this  section.  It  ensures  that  a  successful  step  can 
always  be  found  provided  the  relative  error  in  gk  is  less  than  1  —  ijj. 


k(V/(**  +  Xe)  — V/(**))redX 


THEOREM  3.2.  Let  /  :  R"  — ►  R  be  continuously  differentiable  on  an  open  convex  set  fl  con¬ 
taining  a  point  xk,  and  let  Vf  be  Lipschitz  continuous  on  fl  with  constant  L  £  (0,  oo).  Let 
aredk(s),  predk(s),  ekt  and  X ™,n  be  defined  as  in  Lemma  3.1,  and  let  Dk  be  any  nonsingular 
matrix.  Consider  a  sequence  of  trial  iterates  {«'}  and  associated  trust  radii  {A1}  satisfying 
A'  — *•  0,  predk(s' )  >  0  V-  t ,  |  j  a  *  1 1  <  c  2  A*  Y  «,  and  lim  cos©'  =  1,  with  ©'  defined  to  be 

the  angle  between  Dke'  and  —  DjpTgk.  If  gk  ^  0  and 
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Ik  II 


<  f 


Ik  ll(Dirot)-1 

for  some  f  €  [0, 1  —  f7t),  then  for  sufficiently  small  A'  we  have 


aredk(s‘) 

o'  =  - t—  >  n,  . 

predk(t') 


(3.10) 


(3.11) 


Proof.  Assume  without  loss  of  generality  that  each  A1  is  sufficiently  small  to  imply 
xk  +  s’  GG.  Since  predk(p' )  >  0  i ,  Lemma  3.1  allows  us  to  write 


1  >'  _  pred(s')—  ared(s') 

pred(e') 

*(L-\r)\w  ip- e/v 

-(/>*~ret)r(Z?ts,')  +  %(L-Xfi‘)||«<  ||2 

■  Mz>rk)rk**")-^*")rs*(*") 

Using  the  Cauchy  Schwarz  inequality,  some  algebraic  manipulations,  and  the  definition 


(3.12) 


we  can  rewrite  (3.12)  as 


cos  (©' )  = 


~(Dk  Tgk)T(Dks' ) 

1 1  Ark  II  Ik**"  II  ’ 


Ikrk  ii  ii Dks<  n  +  ^(L-xr)nr  ii2 

M£rk)rk**")-*(*")Ik*(*") 


<  i  I k*~k  n  +  *U'-rr)ll*"  if/ni v  id 

-  \\Dk-Tgk  ||  cos(ei)-%(»1)r5*(«')/(lkrk  II  Ik**"  II)  ' 

Now, 


(3.13) 


(3.14) 


lim 

A'— 0 


ik  n8 

ik**"  ii 


lim 


a-o  [(*")rD*rz?*(*"r 


=  o 


(3.15) 


and 


lim 

A'—0 


W)TBkW) 

ik**"  ii 


lim 
A'— 0 


(*")r^*(*") 


(3.16) 


so  that  by  combining  (3.14),  (3.15),  (3.16)  and  the  hypothesis  lim  cos©'  =  1  we  obtain 

A'— 0 
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lim  (1  —  p' )  < 
0 


II Dr\  ll 

\\Dk-Tgk  || 


He*  1 1  (Z?t  r £>*)“ 1 

llfr  ll(Z>tr£)l()_1 


(3.17) 


Since  f  <  1  —  t)1,  we  therefore  have  1  —  p'  <l  —  rjl  for  sufficiently  small  A*.  This  establishes  our 


result  since  1  —  p‘  <  1  —  r/j  if  and  only  if  p'  >  »?!.  □ 


An  immediate  consequence  of  Theorem  3.2  is  that  Algorithm  (3)  will  either  generate  an 
infinite  sequence  of  iterates  or  terminate  with  V/  (xk )  =  gk  =0  provided  (3.10)  holds  at  every  itera¬ 
tion.  This  can  be  formally  stated  as  follows. 


COROLLARY  3.3.  Let  /  :  R"  — ♦  R  satisfy  the  standard  assumptions  and  let  {Dk}  be  a 
sequence  of  nonsingular  diagonal  matrices.  Then  Algorithm  (3),  using  any  of  the  step  computation 
techniques  of  Section  2,  will  either  produce  an  infinite  sequence  of  iterates  satisfying 
/(**)  <  /  (i*_i)  or  will  terminate  at  some  iterate  xk  with  Vf  (xk)  =  0  provided  the  relative  error  in 
the  gradient  approximation  satisfies 

lk-v/(**)||(£,tn,iri 

at  every  iteration. 

Proof.  Since  any  acceptable  iterate  satisfies  predk(pk)  >  0  and  pk  >0,  f  (xk)  <  f  (**_i)  for 
all  (existing)  iterates,  and  hence  xk  £f2  for  all  (existing)  xk.  Now  suppose  Algorithm  (3)  succeeds 
in  generating  x0,  xu  xk .  If  gk  =  0,  (3.18)  implies  that  Vf(zk)  =  0.  Otherwise,  the  algorithm 
generates  a  trial  step  *’  by  the  methods  of  Section  2.  If  this  step  satisfies  p'  >r/i,  then  xk+l  exists. 
If  not,  the  inner  loop  of  Algorithm  (3)  will  try  A'+1  G  (0,  A'],  A1  +2  £  (0,  7j  A,+1],  etc.  as  per  step 

(a3)  of  the  inner  loop.  Since  <  1,  the  conditions  of  Theorem  3.2  are  satisfied,  and  hence  xk+i 
exists.  Our  result  follows  by  induction.  □ 

Some  remarks  should  be  made  concerning  the  possibility  of  gk  =  0  or  V/(ar*)  =  0  for  some 
iteration  k.  If  gk  =0,  then  (3.1)  requires  that  gk  =V/(xk).  This  is  quite  reasonable,  in  that  if  the 
approximate  gradient  indicates  that  xk  is  a  stationary  point  of  /  ,  then  the  sensible  procedure  is  to 
recompute  gk  with  sufficient  accuracy  to  verify  or  contradict  that  V/(it)  =  0.  We  also  include  no 


<  f  <  l~»h 


(3.18) 
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theory  to  affirm  or  deny  the  implementability  of  the  algorithm  beyond  any  iteration  with 
gk  =  V f  (xk)  =  0.  Methods  exist  (see,  for  example,  [14])  which  are  guaranteed  to  be  repelled  from 
such  a  stationary  point  if  and  only  if  it  is  a  saddle  point,  but  these  methods  assume  that  the 
model  Hessian  Bk  is  the  exact  Hessian  V2/(i*).  Since  the  use  of  a  model  with  an  approximate 
gradient  and  an  exact  Hessian  appears  somewhat  unlikely,  we  prefer  (for  this  paper)  to  say  nothing 
about  the  existence  of  xk+l  if  gk  =  V/  (xt)  =  0. 

3.3.  Relative  error  bound  as  an  auxiliary  to  consistency.  In  one  sense,  Theorem  3.2  might 
be  considered  the  main  result  of  this  paper  in  that  using  (3.10)  as  an  auxiliary  condition  to  (1.5) 
eliminates  the  major  practical  difficulty  in  directly  enforcing  (1.5).  That  is,  (3.10)  assures  us  that 
no  further  increases  in  the  accuracy  of  the  gradient  approximation  will  be  required  in  the  inner 
loop.  Enforcing  (1.5)  for  the  successful  iterates  is  a  lesser  problem  (even  though  conditions  like 
(1.11)  are  still  somewhat  unsatisfying).  Moreover,  (3.10)  is  a  sufficient  condition  for 
liminf  ||y*  ||(£)  Tp  ^  =  0  to  imply  lim  inf  ||V/(i*)  ll^r^j-i  =  0.  Consistency  alone  is  not 

sufficient  to  establish  this  unless  {zt}  converges. 

In  Section  4,  we  show  that  consistency  can  be  entirely  replaced  as  a  primary  assumption  by 
conditions  on  the  relative  error.  However,  for  completeness  we  conclude  this  section  by  showing 
how  using  (3.10)  as  an  auxiliary  assumption  to  consistency  allows  the  results  of  More'  to  be 
strengthened. 

We  first  state  the  following  lemma. 

LEMMA  3.4.  Let  {£>*}  be  a  sequence  of  nonsingular  matrices  in  R*x*  satisfying 
\\DkrDk  ||  <(<r2)2  and  ||(Z>*  T Dk)~l  ||  <  (<t3)2  for  some  constants  <r2><T3<00'  Let  (jt)  and 
{V/  (**)}  be  sequences  in  Ft**'*  that  satisfy  either 

4]^4i  <  f  <  1  V  *  (3-19) 

II?*  II 

or 
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Ik  || 


T  n 


(V®*)' 


Ik  II 

Then  the  following  are  equivalent. 

liminf  ||ft  ||  =  0  . 

*—►00 

hminf  ||ft  ll^r^-i  =  0  . 
liminf  ||V/(xt)  ||  =  0  . 

k— »oo 

^minf  ||V/(x*)  ll(£>trZ)t)-i  =  0  . 
Furthermore,  the  following  are  also  equivalent. 


lim  ||ft  ||=0 
*—•00 


<  f  <  1  v  * 


tlim 

lim  ||V/(x*)||  =  0. 

*—00 

Urn.  ||V/  (**)  ll^r^j-i  =  0 


(3.20) 

(3.21) 

(3.22) 

(3.23) 

(3.24) 

(3.25) 

(3.26) 

(3.27) 

(3.28) 


Proof.  We  first  notice  that  the  conditions  on  {Dk}  imply 

^  lly  II  <  II?  1 1  (ot  *■!)*)->  II  (3.29) 

and 

~  ll»  ll(ptI'ot)->  ^  1 1  y  II  <  ^2 1 1  y  ll^r^,-.  (3.30) 

for  all  y  €RB .  This  immediately  implies  (3.21)  <ss>  (3.22),  (3.23)  <s^>  (3.24),  (3.25)  <s*>  (3.26),  and 
(3.27)  <£>  (3.28).  Now,  if  ||e*  ||/  ||ft  ||  <f  <  1  V"  we  have  that  (3.21)  <*£►  (3.23)  and 

(3.25)  <s$>  (3.27).  If  ll«*  I  l^r^-./ I  k  ll^r^-i  <  f  <  1,  then  (3.22)  <f*  (3.24)  and 

(3.26)  <ss>  (3.28).  Linking  all  of  these  equivalences  immediately  establishes  the  lemma.  □ 

A  hybrid  of  Theorem  3.2  and  the  global  convergence  results  of  More  can  now  be  stated. 


THEOREM  3.5.  Let  /  :R"  — *  R  satisfy  the  standard  assumptions,  let  {£?*},  {DkT  Dk)  and 
{(Dk  TZ)*)-1}  be  uniformly  bounded,  and  let  {x*}  be  the  set  of  iterates  produced  by  Algorithm  (1) 
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using  any  of  the  step  computation  techniques  of  Section  2.  Let  the  gradient  approximations  satisfy 
the  relative  error  bound  (3.18),  and  assume  that  if  the  set  of  successful  iterates  is  an  infinite 
sequence,  then  the  consistency  condition  (1.5)  holds.  Further  assume  that  z*+]  =  xk  =^>  jt+1  =  gk. 
We  then  have  that  either 

liminf  ||fo  ||  =  liminf  ||V/(z*)  ||  =  0  ,  {3.3i) 

*—►00  *— ♦  00  '  ' 

or 


9k  =  V/M  =  0 


for  some  iterate  xk . 


(3.32) 


Proof.  We  first  note  from  Corollary  3.3  that  either  Vf  (rt)  =  gk  =0  for  some  iteration  xk ,  or 
Algorithm  (1)  generates  an  infinite  sequence  of  successful  iterates.  If  {**}  is  an  infinite  sequence, 
then  by  hypothesis  {gk}  is  consistent  and  Morels  [10]  result  that  liminf  ||fft  ||.  T  ._i  =  0  applies 

*—♦00  ^k> 

(our  assumptions  on  /,  {B*},  {Dk},  and  the  step  computation  procedure  are  more  than  sufficient 
to  imply  the  hypotheses  used  in  [10]).  Hence  Lemma  3.4  implies  that  either  (3.31)  or  (3.32)  holds. 
□ 


4.  Global  convergence. 


4.1.  Introduction.  Although  Theorem  3.5  shows  that  applying  (3.10)  as  an  auxiliary  condition 
bypasses  the  largest  practical  difficulty  with  directly  enforcing  consistency,  this  theory  is  still  less 
than  satisfying  because  nothing  is  specified  about  how  fast  1 1  ek  1 1  should  be  forced  to  zero  as  {ar*  } 
converges.  If  a  condition  such  as  1 1  e*  1 1  <  c  Ik  II  is  used,  c  can  be  chosen  to  be  any  value  in 
[0,  oo).  In  Section  4.2,  we  establish  the  same  global  convergence  results  as  in  Theorem  3.5  without 
using  consistency  as  a  primary  hypothesis.  We  instead  use  the  condition 


Ik  II 


{DkTDtr 


<  f  <  1  —  *72  - 


(4.1) 


Since  typical  values  for  usually  fall  in  [0.1,  0.25]  and  typical  values  for  r)2  usually  fall  in  [0.001, 
0.1],  condition  (4.1)  is  only  slightly  more  restrictive  than  (3.10). 
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In  order  to  establish  the  strong  result  lim  | { V /  ( x* )  ||  =0,  More'  [10]  returns  to  the  assump- 

*—•00 

tion  that  gk  =  Vf  (xk).  This  stronger  assumption  is  not  necessary.  We  show  in  Section  4.3  that 

bounding  the  relative  error  in  the  gradient  approximation  to  be  less  than  or  equal  to  any  constant 

f£[0,  l)  is  sufficient  to  establish  lim  ||V/(xt)  ||  =0.  This  strong  global  convergence  result  is 

*—►00 

often  called  first  order  stationary  point  convergence. 

4.2.  Replacing  the  consistency  assumption  with  a  relative  error  bound.  We  now  show 
that  the  results  of  Theorem  3.5  remain  true  if  the  consistency  assumption  is  replaced  by  (4.1). 

THEOREM  4.1.  Let  / :  R*  — ►  R  satisfy  the  standard  assumptions,  let  {B*},  {DkT Dk)  and 
{(£>*  T Dk)~1}  be  uniformly  bounded,  and  let  {**}  be  the  set  of  iterates  produced  by  Algorithm  (3) 
using  any  of  the  step  computation  techniques  of  Section  2.  Let  the  gradient  approximation  satisfy 
the  relative  error  bound  (4.1).  We  then  have  that  either 

lim  inf  ||j*  ||  =  lim  inf  ||V/(xt)  ||  =  0  u  .2) 

or 

?*=v/(x*)=  0  (4.3) 

for  some  iterate  xk . 

Proof.  The  central  ideas  in  this  proof  are  largely  due  to  Powell  [12],  but  we  also  draw  heavily  on 
the  ideas  used  to  prove  Theorem  3.2. 

(a)  We  first  note  that  since  f  <  1  —  tj2  <  1  —  t]lt  by  Corollary  3.3  we  have  that  either  (4.3) 
is  true  or  the  algorithm  generates  an  infinite  sequence  of  successful  iterates  satisfying 
/(**)  <  /  (z*_i)  •  Hence  xk  £  fl  for  all  k . 

(b)  Suppose  {z*}  is  an  infinite  sequence  but 

limjnf  !!?*  >  (  >  0  •  (4.4) 

From  (1.16),  (2.29),  and  the  bounds  on  {Bk}  and  {Dk}  we  have 
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aredk(sk )  >  %»?ie3||y*  I  min 


A*, -j- Ilf*  II 


(4.5) 


for  some  <T]£(0,  oo),  t]i  6(0,1)  and  e36(0,  lj.  Since  /  is  bounded  below,  (4.5)  implies  that 
A*  — ►  0  and  hence  A*  — ►  0.  Assume  without  loss  of  generality  that  k  is  sufficiently  large  to  imply 
II?*  W(Dtd  and  z*  +  «*  ©0-  From  (3.14)  we  have 

,■  N DT\  11/ I \Dr\  ||+  *jL-\r)\\,‘  H7(| \Pk~Tgk  II  ||Z>ta’  ||) 

P  -  cos©’  -  %(«,)rJ5*(«‘)/( \\Dk~Tgk  II  | |Z>*«’  II) 

From  Theorem  2.2  and  Proposition  2.3  we  have  that  A*  — ♦  0  =s>  cos©*  — ►  1  and  since  {Bk} 
and  { (Dk  T Dk)~1}  are  bounded,  we  can  write 


112 


lim 


<  —  lim 


112 


II#*  T9k  II  II Dk»'  II  f  [(«,’)r#*r#*(«,’)]V‘ 


=  0  . 


and 


(4.7) 


lim 


(»'YBk(s') 


<  —  lim 


(*'VBk(s') 


*,.-00  \\Dk  Tgk  II  ||#*a*  II  <  *,'-00  [(«'  )T  DkT  Dk(s'  )]* 


=  0 


(4.8) 


Furthermore,  \™n  is  bounded  away  from  —  oo,  so  combining  (4.6),  (4.7),  and  (4.8)  gives 

lim  (1  -  />*)  <  lim  <  f  <  1  -  >?2 ,  (4-9) 

.-oo  *-oo  I |Z>*  *gk  II 

therefore  there  exists  «  such  that  «  >  i  =>  1  —  p'  <  1  —  tj2,  and  hence  p'  >  r/2.  But  since  no  trust 
radius  reduction  is  allowed  if  p'  >  r?2,  we  have  that  lim  inf  A*  >  0,  which  implies  lim  inf  A*  >  0, 

I— *00  *  — 0 o 

which  is  a  contradiction.  Thus  if  {z*}  is  an  infinite  sequence, 


lim  inf  ||ft  N(^rl  =  0 

The  result  (4.2)  follows  from  (4.10)  and  Lemma  3.4.  P 


(4.10) 


4.3.  First  order  stationary  point  convergence. 


4.3.1.  Relative  error  measured  in  the  Euclidean  norm.  The  following  theorem  establishes 
first  order  stationary  point  convergence  provided  the  sequence  {xk }  satisfies  the  weaker  property 
liminf  ||V/(zt)  ||  =0,  the  uniform  predicted  decrease  condition  holds,  and 

*— .  oo 
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||e*  ||/  ||?*  ||  <f  <  1.  This  is  a  very  powerful  result,  as  it  allows  us  to  obtain  the  strong  result 
lim  ||V/(a:*)  ||  =0  directly  from  the  previously  established  weak  convergence  property  (4.2) 

k—  oo  ’ 

without  using  any  information  concerning  the  trust  radius  updating  procedure. 


THEOREM  4.2.  Let  /  :  R*  — »R  satisfy  the  standard  assumptions.  Let  }  be  an  infinite 
sequence  of  vectors  which  satisfy  the  (UPD)  condition 

Prtdk{»k)  >  ttc3||?*  Il^r^j-imin  A*,—  II?*  Il^r^-i  (4.11) 

and 

aredk(»k)  >  niPredk(sk),  (4.12) 

where  At  is  a  positive  number  satisfying 

Ik  1 1  £>t  r  £>t  <  ^A*  (4.13) 

with  c3G(0,  1],  (7,6(0,00),  t?,€(0, 1)  and  e2G[l,2). 

Let  the  sequence  of  scaling  matrices  {£>*}  satisfy 

\\DkTDk  ||  <(ff2f  (4.14) 

and 


for  <t2,<t3€(0,  oo).  If 


for  all  k  with  ?  £  [0,  l),  then 


ll(£*r£*)-,H<(*3)2 

ll?*-V/(xt)||  ^ 

II?*  II 


liminf  ||?*  ||  =0  =s>  lim  H?*||=0**>  lim  ||V/ (**)  ||  =  0  . 

*-*oo  *— ►oo  *— ►oo 


(4.15) 


(4.16) 


(4.17) 


Proof.  Define  t  =  V4  (1  —  ?)/(l  +  f)  and  consider  any  iterate  xm  with  nonzero  gm  .  Since 
lim  inf  1 1  ?*  1 1  =0,  there  exists  m  >m  for  which  1 1  ?s+1  1 1  <  c  1 1  gm  1 1  and  1 1  ?*  1 1  >  *  1 1  gm  II  for 

*— »  oo 

all  k  £  [m ,  mj.  Now,  from  (4.11)  and  (4.12) 
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/(*m)-/(*sr+i)  =  Earerf*(**)^  T,  Vipredk(ak) 

Jk»»m 


>  E  **  «s  IlfclLrj,,  _i  min 

m  *  * 


life  N(£>/*i>4ri 


(4.18) 


Using  the  facts  that  ||j*  ILrD  ri>—  ||tf*  II,  A*  >  —  ||«*  ||  T  .  >  — ||«*  ||,  and 

\ukuk)  C  2  \ukuk>  <T$C2 


^2 

II?*  II  >  «  1 1 9m  II  for  all*  e  [  m,m ],  (4.18)  can  be  transformed  into 


/(*m)-/(*m+ 1)  >HhC3-^-  ll?m  II  E  min  |  ^**  '  ^  ^ 

ff2  *_m  l  ff3c2 


(4.19) 


This  can  be  divided  into  two  cases. 

1  ^  2 

0)  if  11*.  ll>-~-Mk  1 1  for  at  least  one  it  £  [m  ,  m  ] ,  we  have 

£  <r3c2 

f(xm)-f(x-+1)>'Anlcs(^-f  llg;  11  ■  (4.20) 

<r2  tr, 

(ii)  Otherwise, 

f(zm)-f(T* t-h)  >  ^*?1C3T~  ^  E  11**11-  (4.21) 

*2  »3C2  *_m 

m 

Now,  in  order  to  merge  case  (i)  and  case  (ii),  we  need  to  establish  a  lower  bound  on  E  II**  II- 

k 

From  the  triangle  inequality  we  can  write  ||jm  ||<  ||?s+i—  ?m  11+  ||jfo+i  II  an(f  hence 
1 1  gm  1 1  <  ||  -  ?m  1 1  +  e  1 1 9m  1 1  •  By  rearranging  terms,  again  applying  the  triangle  inequality, 

invoking  the  Lipschitz  continuity  of  V/  ,  and  substituting  in  e*  =  g*  —  V/  (z*),  we  can  obtain 

(l-<)ll?m  II  <  ll?m+l-?m  II 

<  l|V/(x;r+1)-V/(zm)||+  llem+l  —  em  II 

<£||i=tl-i„ll+ll«i„-«.ll  (422) 

<  L  E  1 1  **  1 1  +  1 1  eST+l  11+  1 1  ern  II  • 

k"*m 

Substituting  (4.22)  into  (4.21)  yields 
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\\gm  ||[(l-0  ||?m  ||-  ||es+1||-  ||em  ||] 

<72a3C2  Li 

^1/  it  ||2^i  \  1 1 II  ll^m+l  II  H^m+l  II 

*  ^L"0m"[] 1"£,"1CT  llfc+1ll  life  II 

>  a  \\gm  II2  [l  -€  —  «•  —  s-c] 

(7  yy  $C  2*-/ 

^2^3  c  2-k 


(4.23) 


Hence  for  either  case  (i)  or  case  (ii)  we  have  that 


f{xm)-f{x~. 1)  >  f  llffm  II5 


where  £  is  the  positive  constant  £  =  1At]1c3  —  min 

£T2 


(  l~f 

@2^1  2Zy^3f2 


(4.24) 


Now,  by  hypothesis,  /  is  nonincreasing  and  bounded  below,  so  {/(**)}  must  converge  to  some 
limit,  say  /  *.  Thus,  for  any  m ,  either  gm  =0  or 

llilm  II  <  (/(*») -/(*5r+i))/« 

.  (4.25 

<(/(*„)-/')/<• 

Therefore  ff*  — ♦  0  and  by  Lemma  3.4,  V/  (z* )  — *•  0.  □ 


Condition  (4.16)  is  a  fairly  natural  condition,  but  it  is  slightly  different  from  the  condition 
used  previously  because  it  measures  the  relative  error  in  the  Euclidean  norm  while  (110)  measures 
it  in  the  elliptical  norm  induced  by  (£>* TDk)~'.  In  the  next  section  we  introduce  a  variation  of 
Theorem  4.2  which  uses  (1.10). 

4.3.2.  Relative  error  measured  in  the  norm  induced  by  the  scaling  matrices.  The  fol¬ 
lowing  theorem  establishes  first  order  stationary  point  convergence  under  conditions  similar  to 
those  of  Theorem  4.2.  There  are,  however,  two  differences.  First,  we  assume 
||e*  j-i/  II?*  lljprp  )-i<?  <  1  to  be  consistent  with  the  theory  in  Section  4.2.  Second,  we 

impose  an  extra  condition  on  the  sequence  of  scaling  matrices. 
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THEOREM  4.3.  Let  the  hypotheses  of  Theorem  4.2  be  satisfied,  with  the  exception  that 
(4.16)  is  replaced  by 


ik-y/wii, 

II?*  Il(z>r£>1)-i 


(4.26) 


for  all  k  with  f  €  [0, 1).  Let  us  further  assume  that  there  exists  a  constant  L  6  (0,  oo)  such  that 

\\Dk-J1vf(xi+1)-Dk-Tvf(xk)\\<L  ik  ||  (4.27) 

for  all  k.  We  then  have  that 


liminf  | |j*  1 1  =  0  =>  lira  1 1 y*  )|  =  0  =s>  lim  ||V/ (x*)  ||  -  0  .  (4.28) 

*— >0  *— »oo  k— »oo  '  ' 

The  proof  of  Theorem  4.3  is  quite  similar  to  that  of  Theorem  4.2,  so  we  shall  defer  it  until  the 
Appendix. 

Condition  (4.27)  is  quite  interesting.  If  a  fixed  scaling  matrix  D  is  used  rather  than  an  adap¬ 
tive  scaling  technique,  (4.27)  is  automatically  implied  by  the  Lipschitz  condition  on  V/ .  Further¬ 
more,  simply  assuming  that  {DkT Dk}  and  {(DkDk )-1}  are  bounded  is  definitely  not  sufficient  to 
imply  (4.27).  For  example,  if  V/  (x*+1)  =  V/  (x* ),  | \Dk+x~T  V/  (x*+I)  -  Dk~TVf  (xk)  \  \  = 

ll(D*+rr-z>*-r)v/(x*)i|. 

Adaptive  scaling  is  poorly  understood  at  present.  Most  implementations  that  make  use  of  it 
generate  {Dk}  by  heuristic  methods  rather  than  procedures  with  a  firm  theoretical  basis.  Given 
this  lack  of  understanding,  theoretical  conditions  such  as  (4.27)  are  important  because  they  suggest 
guidelines  to  be  used  in  designing  methods  for  generating  scaling  matrices  {Dk}. 

An  extension  of  our  theory  which  might  seem  desirable  would  be  a  result  analogous  to 
Theorem  4.2  but  with  the  relative  error  expressed  in  the  Euclidean  norm.  Such  a  theorem  would 
increase  the  symmetry  between  the  results  of  Sections  4.2  and  4.3.  Unfortunately,  this  conjecture 
is  not  true.  Say,  for  example,  V/ (xt)  =  (—  Vi,  1  )r ,  gk  =  (tt,  l)r,  »/2  =  0.1,  and  Dk  —  ^J.  Now, 

ek  =(1,0)  and  1 1 et  ||  /  ||j*  ||  =  V4/5  <  .9,  so  that  our  condition  j| e*  1 1/  ||?*  ||  < f  <  1  —  t]2  is 
satisfied.  However,  the  preconditioned  steepest  descent  direction,  —  (DkDk)~l  gk  ,  is  —  (8, 1)T .  This 
is  not  a  descent  direction  for  /  since  (— V/ (x*))r(  — 8 ,  — 1  )r  <0.  Therefore,  since  s'  tends  in 
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direction  toward  —(DkDk)gk  1  as  A'  — ♦  0,  a  sufficiently  small  A0  will  imply 
aredk(s'  )/predk(s' )  <  0  \f-  A*  <  A0. 

5.  Conclusion. 


5.1.  Summary  of  results.  The  global  convergence  result  Iiminf  I  Iff*  I  »-i  =  0  has  previ- 

*—oo  l  Dk  L>k> 

ously  been  shown  for  trust  region  algorithms  that  use  inexact  gradient  values  provided  these 
approximations  are  consistent.  We  demonstrate,  however,  that  for  implementations  that  do  not 
update  gk  on  unsuccessful  iterations,  the  algorithm  may  fail  at  a  point  xk  with  gk  0.  This  failure 
cannot  occur  if 


Ilf*  II 


<  f 


(5.1) 


and  f€[0, 1—  Jjj).  Furthermore,  if  (5.1)  holds  with  f6[0, 1  — »?2),  the  result 
Iiminf  ||ff*  |j  =  lim  inf  ||V/(x*)  ||  =0  can  be  established  without  using  consistency  as  a  primary 

/fc— oo  k~*oo 


assumption17  :  consistency  is  instead  a  consequence  of  our  theory.  Finally,  (5.1)  also  allows  us  to 

obtain  the  strong  global  convergence  result  lim  ||V/(x*)  ||  =0  provided  (4.27)  holds. 

k  —♦00 


Since  many  of  the  procedures  used  for  generating  gradient  approximations  simultaneously 
provide  an  error  estimate,  our  results  provide  a  practical  criteria  for  deciding  whether  a  given 
approximation  is  adequate. 


5.2.  Final  remarks.  Several  possibilities  suggest  themselves  for  future  study.  One  is  to  estab¬ 
lish  our  results  using  alternative  assumptions.  Rather  than  taking  ||e*  ||/||ff*  ||  <f,  we  might 
try  assumptions  like 


or 


f*rV/(x*)  ^ 

lit.  II  l|V/(*.)||  ~S 


(5.2) 


I7This  result  uses  mild  assumptions  on  /  and  assumes  that  {Bt},  [DtTDk)  and  {(DkDk)  ’}  are  uniformly  bounded. 
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h  < 


ii v/(*t)n  „  i 

Ik  II  91  ' 


(5.3) 


The  two  examples  in  Section  3.1  show  that  neither  (5.2)  nor  (5.3)  taken  alone  is  sufficient  to  :mply 
implementability,  but  some  combination  of  similar  assumptions  might  work.  The  existence  (and 
utility)  of  such  alternative  assumptions  is  an  open  question. 


Another  topic  for  future  research  is  to  examine  the  local  convergence  rates  of  these  methods. 
Steihaug  [16]  establishes  g-superlinear  convergence  for  a  class  of  trust  region  algorithms  assuming 

(among  other  things)  that  lim  —  — rr r~"  —  0,  or  equivalently  lim  ^  *  k  +  Sk)  *  II 


||V/(X*)  ||  M  *-~oo  Wtk  ~  ek  II 

=  0.  The  structural  similarity  between  Steihaug’s  analysis  and  that  of  this  paper  suggests  that 


g-superlinear  convergence  can  be  obtained  if  lim 


Ik  II 


=  0.  This  is  probably  an  unrealistic 


Ik  II 

assumption  since  gradient  approximations  are  generally  used  only  when  exact  (or  almost  exact) 
values  are  extremely  expensive  computationally,  so  an  important  question  is  the  existence  of  less 
restrictive  assumptions  which  imply  fast  local  convergence. 


6.  Appendix. 


6.1.  Proof  of  Theorem  2.2.  First  notice  that  case  (i)  can  be  treated  as  a  special  instance  of 
case  (ii)  by  defining  gk  =  gk  and  Bk  =  Bk  \J-  k  >k .  To  prove  case  (ii),  we  recall  that  by  Theorem 

2.1,  there  exists  a  sequence  of  nonnegative  numbers  {/j‘  }  such  that 

{B/t  +//'/)«*  =  —  ff*  (6.1) 

and 

II*1'  ||  <  c2A*'  (6.2) 

with  Bk  +  ii'  I  positive  semidefinite.  Applying  the  Cauchy  Schwarz  inequality  to  (6.1)  gives 

lk'll>  lkll/l|B*+#*i/||  •  (6-3) 

Suppose  there  exists  €  >  0  such  that  ||^*  ||  >  c  for  all  k  sufficiently  large.  Equation  (6.3)  and  the 
hypothesis  that  {Bk}  is  bounded  establishes  that 
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0  =»>  fi'  — ►  oc 


(6.4) 


Now, 


1 6'  = 


~{»')T9k 

Ilf  II  111?'*  II 


(6.5) 


so  that  by  substituting  (Bk  +  /<’  /)  i ‘  for  —  gk  in  (6.5)  and  expanding  the  resulting  terms  we  get 

(-•')T(£*  +/.•■/)-' 


cos©'  = 


ll«‘  II  1 1(5*  +/»•■/)  i'll 


1  +  -V(«')r^*(*’)/ll*‘  II2 


(6.6) 


I  +  4(«0TB*(^)/lh'll2+(4)V)r^rB*(^)/||*'‘  ||2 

fi'  n' 


Hence  by  (6.2),  (6.4),  (6.6)  and  the  hypotheses  limsup  ||S*  ||  <oo  and  lim  A*  *=  —  0,  we  have 

k  — *oo  i  — *oo 

lim  cos  (©' )  =  1 .  □ 

*,»'—►  OO 


6.2.  Proof  of  Theorem  4.3.  The  proof  of  this  theorem  is  quite  similar  to  that  of  Theorem  4.2. 
Define 


€  =  V4(l-f)/(l  +  f) 

and  consider  any  iterate  xm  with  nonzero  gm . 


(6.7) 


Since  lim  inf  ||j*  { |  =  0,  by  Lemma  3.4  we  have  that  lim  inf  ||Z?*  T  gk  ||  =0,  and  thus  there 
*-♦00  *— ►  00 

exists  m  >m  for  which  \\Dk+fTg-+x  II  <£  \\D^Tgm  ||  and  \\Dk~T  gk  ||  >£  \\D~Tgm  ||  for  all 
ib£[m,m].  Using  equation  (4.18)  and  the  facts  that  A*  >  —  ||«t  Tp  > — - —  ||«t  ||  and 


3C2 


II#*  T  Qk  II  >«ll  DmTgm  ||  V  *e[ro,m],  we  can  write 


/(*«)-/(*m+i)  ^  7-,?ie3e \\D^Tgm  ||  fi  min 


II**  II  \\D^Tgm  II 

-  £  - 


(68) 


We  then  use  the  triangle  inequality  to  show 


II D~Tgm  ||  <  \\D~Tgm  -D^g-+l  ||  +  \\D^g-+l  || 
<  \\D~Tgm  -  ||  +£  || D~Tgm  || 


(6.9) 
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so  that 


(1-0  II A»-rfm  II  <  \\D-Tgm  II  •  (6.10) 

Rearranging  terms,  defining  ek  =  gk  —  Vf  (**),  and  again  applying  the  triangle  inequality  allows  us 
to  write 


(l-e)\\D-Tgm  ||  <  I!  D-TVf(zm)  +  D~Tem  -  (X-+i)  -  D^e-+1  || 

<  \\D^Vf(Xm)-D^1Vf(x-+1)\\+  || D~Tcm  ||+  \\D^e-+1  || 

fn 

=  IIE(^'rv/(a:*)-D*+r1v/K+1))ll+  llA»-r<m  11+  \\DiT+lt-+l  || 

_*“m  (6.11) 

<  E  ll^-rv/(x*)-z)*+r1v/(^+1)||+  II D-Ttm  11+  iiz^^n 

k  — m 

<LE  ll^ll+ll^-^ll+ll^^ll. 

k  •mm 

Using  (6.7),  (6.11),  and  the  inequality  WD-^g-^  ||  <£  \\D~Tgm  ||  gives 


E  II**  II  >L~'  [(l-c)\\D-Tgm  ||-  |l£>-rem  ||-  WD^c-^  ||] 

\\D~Tcm  ||  l|Z)--+V-+1|| 


>L~T\\D-Tgm\\ 
>L^\\D~Tgm  || 


(i-0- 


1  —  f  —  f  — 


II D-Tgm  ||  | \D~Tgm  || 

IU#.«S+ill  lU^te+ill 


\\DmIi9z+i  II  II DmTg„ 

>  L~l  II D~Tgm  ||  [l  —  f  —  < (1  +  o] 


(6.12) 


>  II D~Tgm  ||}(l-f)/L  • 


Substituting  this  into  (6.8)  yields 


/(*m)-/(*SM.i)  >  e  I! D~Tgm  IIs 


where 


_  1 

£  =  —f)ie3c  min 


£  1-g 


>  0  . 


(6.13) 


(6.14) 


o  l  2  tr3c2L 

By  hypothesis,  /  is  nonincreasing  and  bounded  below  so  that  /  (ik)  — *  /  *  for  some  /  * .  Thus  for 
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any  m  ,  either  g„, 


Therefore,  lim 
*—►00 


lim  1 1 V/ (x*)  1 1  = 

*-►00 


0,  or 


\\D;T9m\?<{f{*m)-n*X+y))/i 

<(/(*«)-/*)/<■ 


||Z?±  T 9k  I!  =  0,  and  by  Lemma  3.4,  lim  ||^ 

*—00 

0.  □ 


(6.15) 

=  0  and 
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